Problem Statement¶
Context¶
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective¶
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
Data Dictionary¶
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Problem Understanding¶
- The business is looking for conversion of liability customers to asset customers, the success metric for that is getting those customers to apply for personal loans as they haven't applied for them yet. For that, its important that:
- Identify customers who have potential to apply for personal loans, solved by a model capable of prediction
- Attributes of customers applying for personal loan be understood, which can be understood by model interpretability and data exploration
- Find out those customers which haven't bought personal loans but which appear to be good candidates for conversion
Importing necessary libraries¶
# Installing the libraries with the specified version.
%pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
Loading the dataset¶
# loading the dataset using pandas
personal_loan = pd.read_csv("./Loan_Modelling.csv")
# making a copy of the original data
data = personal_loan.copy()
Data Overview¶
# view first 5 rows of the data
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# view last 5 rows of the data
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Number of rows and columns¶
# shape method gives us a tuple of (number of rows, number of columns)
data.shape
(5000, 14)
Exploring data types and null values¶
# info() gives us a summary of the dataset including the data types and non-null counts
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
Overall Statistical Summary¶
# describe() shows the statistical summary of numerical columns in the dataset, .T takes the transpose for better readability
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
- Observations:
- The dataset has 5000 observations across about 13 variables
- From the info method all data types we have are of numerical type, there is no categorical type
- It can be safely assumed from this that null values are not present. Output of isnull() also confirms the same
- Analyzing the data dictionary and exploring head() and tail(), we find that there are some columns that are categorial but their values are being represented numerically:
- ZIPCode
- Education
- Securities_Account
- CD_Account
- Online
- CreditCard
- Personal_Loan (categorical) is the dependent variable that needs to be predicted, which has a 0 or 1 value (Yes/No)
- Minimum experince is -3, which seems like an incorrect value
- Average age is 45
- Following variables have a wide spread:
- The range of income is from 39k to 224k
- Range of average credit card spending is from 700 to 10k
- Range of Mortgage is from 0 to 635k
Data Preprocessing¶
- Missing value treatment
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
Dropping extraneous or redundant features¶
# dropping the 'ID' column as it is essentially just like the index
data = data.drop(['ID'], axis=1)
Data Sanity Check¶
# checking for duplicate rows
data.duplicated().sum()
np.int64(0)
# checking null values
data.isnull().sum()
Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
Check Incorrect Values¶
# find out each unique value in 'Experience' column
data["Experience"].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43])
# check experience less than 0
data[data["Experience"] < 0]["Experience"].unique()
array([-1, -2, -3])
# replace negative values in 'Experience' column with their absolute values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
## Converting the data type of categorical features to 'category' using the astype() method
categorical_variable = ["Education","Personal_Loan","Securities_Account","CD_Account","Online","CreditCard","ZIPCode"]
data[categorical_variable] = data[categorical_variable].astype("category")
Checking other categorical variables and variables with low number of value counts for any strange or anomalous values¶
print(data["Education"].unique())
print(data["Personal_Loan"].unique())
print(data["Securities_Account"].unique())
print(data["CD_Account"].unique())
print(data["Online"].unique())
print(data["CreditCard"].unique())
print(data["Family"].unique())
[1, 2, 3] Categories (3, int64): [1, 2, 3] [0, 1] Categories (2, int64): [0, 1] [1, 0] Categories (2, int64): [0, 1] [0, 1] Categories (2, int64): [0, 1] [0, 1] Categories (2, int64): [0, 1] [0, 1] Categories (2, int64): [0, 1] [4 3 1 2]
- There are some negative values in 'Experience' columns. Since experience is never negative, it can be assumed that these are incorrectly typed
- Since the values make sense if they were positive, replacing these values in Experience column with their absolute versions
- Other columns seem to show proper values
Outlier Detection¶
# find 25th and 75th percentiles which are known as q1 and q3 respectively using quantile() method
q1 = data.select_dtypes(include=["number"]).quantile(0.25)
q3 = data.select_dtypes(include=["number"]).quantile(0.75)
# inter quartile range (IQR) is calculated as the difference between q3 and q1
iqr = q3 - q1
# finding the lower and upper limit for all numerical variables, which will dictate whether a value is an outlier or not
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
upper_bound
Age 85.0 Experience 60.0 Income 186.5 Family 6.0 CCAvg 5.2 Mortgage 252.5 dtype: float64
lower_bound
Age 5.0 Experience -20.0 Income -49.5 Family -2.0 CCAvg -2.0 Mortgage -151.5 dtype: float64
((data.select_dtypes(include=["number"]) < lower_bound)| (data.select_dtypes(include=["number"]) > upper_bound)).sum() / len(data) * 100
Age 0.00 Experience 0.00 Income 1.92 Family 0.00 CCAvg 6.48 Mortgage 5.82 dtype: float64
(data.select_dtypes(include=["number"]) < lower_bound).sum()/len(data) * 100
Age 0.0 Experience 0.0 Income 0.0 Family 0.0 CCAvg 0.0 Mortgage 0.0 dtype: float64
(data.select_dtypes(include=["number"]) > upper_bound).sum()/len(data) * 100
Age 0.00 Experience 0.00 Income 1.92 Family 0.00 CCAvg 6.48 Mortgage 5.82 dtype: float64
Data Preparation for Modeling¶
# Experience is almost completely correlated with Age, hence it can be dropped
# X represents the independent variables, Y represents the target variable
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]
# performing one-hot encoding on categorical variables
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
# performing type conversion for the independent variable dataframe
X = X.astype(float)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
# check percentage of response variable class distribution in training and testing samples
print(f"% Distribution of response variable in training set: {y_train.value_counts(normalize=True)*100}")
print(f"% Distribution of response variable in test set: {y_test.value_counts(normalize=True)*100}")
% Distribution of response variable in training set: Personal_Loan 0 90.542857 1 9.457143 Name: proportion, dtype: float64 % Distribution of response variable in test set: Personal_Loan 0 90.066667 1 9.933333 Name: proportion, dtype: float64
Observations & Sanity Checks
Value Types:
- There are no duplicate rows
- No missing values
- There were some negative values in the Experince column which appear to be a typing mistake. They were corrected accordingly
Feature Modification:
- There are categorical variables whose values have been represented as numerical and are being interpreted as numerical. Their value type has been converted to categorical
- 'ID' column represented a redundant feature which was essentially just an index/serial number starting from 1. It was removed
Outliers:
- It seems all outliers lie above Q3, that is the outlier nature is towards values greater than expected
- No values below Q1 exist for any of the numerical variables
- Outliers are present for following features:
- Income
- CCAvg
- Mortgage
- Outlier values also don't seem to be unreasonable and do not require treatment.
Data Preparation for Modelling:
- The independent and response variables have been separated
- One-hot encoding for a couple of categorical variables has been performed
- Train-test split was performed with a test sample size of 30%
- Type conversion to ensure consistent independent variable data type
- Verification of equal distribution of the classes in both training and testing samples
Exploratory Data Analysis.¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
- How many customers have credit cards?
- What are the attributes that have a strong correlation with the target attribute (personal loan)?
- How does a customer's interest in purchasing a loan vary with their age?
- How does a customer's interest in purchasing a loan vary with their education?
Univariate Analysis¶
def numerical_plots(data, feature, figsize=(12, 7)):
"""
Function to plot histogram and boxplot together for a given feature.
data: dataset/dataframe
feature: variable, column from the dataset
figsize: size of figure (default (12,7))
"""
# Following code creates 2 subplots in a single column
# nrows = Number of rows in the subplot, which is 2 here
# sharex = True; x-axis will be the same for both plots
# gridspec_kw = height ratios for both plots; boxplot will be smaller than histogram
# f2 = figure object
# sub_boxplot, sub_histplot = axes objects for boxplot and histogram respectively.
# First is for boxplot and second is for histogram
f2, (sub_boxplot, sub_histplot) = plt.subplots(nrows=2,sharex=True,gridspec_kw={"height_ratios": (0.40, 0.60)},figsize=figsize)
# ax= specifies the subplot where the plot will be drawn
sns.boxplot(data=data, x=feature, ax=sub_boxplot, showmeans=True, color="yellow")
sns.histplot(data=data, x=feature, kde=True, ax=sub_histplot, palette="magma")
sub_histplot.axvline(data[feature].mean(), color="red", linestyle="--") # Add mean to the histogram
sub_histplot.axvline(data[feature].median(), color="black", linestyle="-") # Add median to the histogram
# function to create labeled barplots
def barplots(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(data=data,x=feature,palette="Paired",order=data[feature].value_counts().index[:n].sort_values())
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(100 * p.get_height() / total) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(label,(x, y),ha="center",va="center",size=12,xytext=(0, 5),textcoords="offset points",) # annotate the percentage
plt.show() # show the plot
Plot for Age¶
# histogram and boxplot for 'Age' column
numerical_plots(data, "Age")
Plot for Experience¶
# histogram and boxplot for 'Experi' column
numerical_plots(data, "Experience")
Plot for Income¶
# histogram and boxplot for 'Income' column
numerical_plots(data, "Income")
Plot for CCAvg¶
# histogram and boxplot for 'CCAvg' column
numerical_plots(data, "CCAvg")
Plot for Mortgage¶
# histogram and boxplot for 'Mortgage' column
numerical_plots(data, "Mortgage")
Plot for Family¶
# Following cells now show barplots/countplots for each variable and show the percentages at the top of each value/category
barplots(data, "Family", perc=True)
Plot for Education¶
barplots(data, "Education", perc=True)
Plot for Securities_Account¶
barplots(data, "Securities_Account", perc=True)
Plot for CD_Account¶
barplots(data, "CD_Account", perc=True)
Plot for Online¶
barplots(data, "Online", perc=True)
Plot for CreditCard¶
barplots(data, "CreditCard", perc=True)
Plot for ZIPCode¶
barplots(data, "ZIPCode", perc=True, n=20)
Observations:¶
Age = Symmetric distribution, mean and median around 45 and having no outliers
Experience = Also symmetric distribution, mean and median around 20. 3rd Quartile is around 30, and graph tapers off sharply around 40.
Income = Right skewed, lots of outliers on right, so there are many high income people as outliers.. mean approximately 70, median close to 65. Maximum value is around 224.
Credit Card Average Spending = Also right skwed, high outliers present. Maximum 10, mean is 2 and median is around 1.5
Mortgage = Heavily right skewed. Median is zero, which means at least 50% of the people have zero as mortgate amount. Mean is around 60 and there are lots of outliers. Maximum value is over 600.
Family = Single family members are highest in proportion
Education = Number of undergrads is highest
Securities Account = 90 percent of the customers do not have a securities account
Certificate of Deposit = Around 94 percent of the customers do not have a certificate of deposit account
Online = Around 40 percent of customers do not use online banking
Credit Card = Around 70 percent of customers have credit cards with other banks
Zipcode = 3 of the observed zipcode values are above 2 percent
Bivariate Analysis¶
def cat_vs_response_plot(data, input, response, figsize=(12, 5)):
"""
Plot categorical variable against response variable in a stacked bar chart
data = dataframe
input = categorical variable
response = target variable
"""
f2, (tab_plot, tab_normalized_plot) = plt.subplots(ncols=2,figsize=figsize)
tab = pd.crosstab(data[input], data[response], margins=True)
tab_normalized = pd.crosstab(data[input], data[response], normalize = 'index')
print(tab)
tab.plot(kind="bar", stacked=True, ax=tab_plot)
tab_normalized.plot(kind="bar", stacked=True, ax=tab_normalized_plot)
plt.legend(loc="upper right", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(data=data[data[target] == target_uniq[0]],x=predictor,kde=True,ax=axs[0, 0],color="teal",stat="density")
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(data=data[data[target] == target_uniq[1]],x=predictor,kde=True,ax=axs[0, 1],color="orange",stat="density")
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(data=data,x=target,y=predictor,ax=axs[1, 1],showfliers=False,palette="gist_rainbow")
plt.tight_layout()
plt.show()
Taking out correlation and plotting Heatmap¶
plt.figure(figsize=(12, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Exploring relationship of "Personal Loan" with some categorical variables¶
cat_vs_response_plot(data, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education 1 2003 93 2096 2 1221 182 1403 3 1296 205 1501 All 4520 480 5000
Personal_Loan vs Family¶
cat_vs_response_plot(data, 'Family', 'Personal_Loan')
Personal_Loan 0 1 All Family 1 1365 107 1472 2 1190 106 1296 3 877 133 1010 4 1088 134 1222 All 4520 480 5000
Personal_Loan vs Securities_Account¶
cat_vs_response_plot(data, 'Securities_Account', 'Personal_Loan')
Personal_Loan 0 1 All Securities_Account 0 4058 420 4478 1 462 60 522 All 4520 480 5000
Personal_Loan vs CD_Account¶
cat_vs_response_plot(data, 'CD_Account', 'Personal_Loan')
Personal_Loan 0 1 All CD_Account 0 4358 340 4698 1 162 140 302 All 4520 480 5000
Personal_Loan vs Online¶
cat_vs_response_plot(data, 'Online', 'Personal_Loan')
Personal_Loan 0 1 All Online 0 1827 189 2016 1 2693 291 2984 All 4520 480 5000
Personal_Loan vs CreditCard¶
cat_vs_response_plot(data, 'CreditCard', 'Personal_Loan')
Personal_Loan 0 1 All CreditCard 0 3193 337 3530 1 1327 143 1470 All 4520 480 5000
Personal_Loan vs ZIPCode¶
top_10_zips = data['ZIPCode'].value_counts().head(10).index
df_filtered = data[data['ZIPCode'].isin(top_10_zips)]
cat_vs_response_plot(df_filtered, 'ZIPCode', 'Personal_Loan', figsize=(20, 5))
Personal_Loan 0 1 All ZIPCode 90095 63 8 71 91320 51 2 53 91711 48 4 52 92037 49 5 54 93106 53 4 57 93943 50 4 54 94025 48 4 52 94305 114 13 127 94720 150 19 169 95616 110 6 116 All 736 69 805
Let's check how a customer's interest in purchasing a loan varies with their age¶
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
Personal Loan vs Experience¶
distribution_plot_wrt_target(data, "Experience", "Personal_Loan")
Personal Loan vs Income¶
distribution_plot_wrt_target(data, "Income", "Personal_Loan")
Personal Loan vs CCAvg¶
distribution_plot_wrt_target(data, "CCAvg", "Personal_Loan")
Correlation:
- Age and experience almost completely correlated
- Income and CCAvg spending are highly correlated
- Some weak correlation exists between income and mortgage
Response Variable against Categorical Variables:
Personal Loan vs Education:
- Undergrads represent lowest possibility of loans
- Graduates/advance education customers have more loans
Personal Loan vs Family:
- Larger famlies seem to show good loan percentage wise
Peronal Loan vs Securities Account, Online, Credit Card (for other banks):
- No obvious relationship from the graphs
Personal Loan vs Certificate of Deposit:
- People with Certificate of Deposit account more likely to take out loans
Personal Loan vs ZIPCode:
- Also no apparent relationship. Only 3 zip code areas have more customers so more loans are coming from there, but no special relation of loan with ZIPCode itself.
Response Variable against Numerical Variables:
Personal Loan vs Age:
- Younger groups (less than 40) seem to have more probability of going for loans.
- Variance is less pronounced in people taking out personal loans, which suggests the same that younger people are likely to apply for personal loans.
Personal Loan vs Experience:
- Similar to distribution of Age
Personal Loan vs Income:
- From the plots, higher income customers appear to be more inclined to take out loans
- Many outliers are present in non-loan takers (high income people not taking loans)
- There are peaks (for Personal Loan applicants) at 130 and 175. After which there is a sharp decline
- People who have high income and are not taking out loans seem to be a good target for marketing efforts
Personal Loan vs CCAvg:
- Heavily skwed, high spenders more likely to take out loans
- There are outliers in people who are not applicants for personal loans
- These outliers are also recommended for marketing efforts
Other Observations¶
sns.pairplot(data, hue="Personal_Loan", vars=["Age", "Income", "CCAvg", "Mortgage"], palette="Set2");
numerical_columns = ['Age', 'Income', 'CCAvg', 'Mortgage']
categorical_columns = ['Education', 'Family', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
fig, axes = plt.subplots(nrows=6, ncols=4, figsize=(30, 30))
# 4. Loop and plot
for i, cat in enumerate(categorical_columns):
for j, num in enumerate(numerical_columns):
sns.violinplot(data=data, x=cat, y=num, hue='Personal_Loan', palette='Set3', ax=axes[i, j])
plt.tight_layout()
plt.show()
Other Observations:
- High income and high Credit Card spending are very likely to apply for loan across all categories
- Some observations on undergraduates/non-graduates
- High income among them are mostly liability customers
- High credit card spenders are also among liability customers
- Some high mortgage customers are also liability customers
- Some observations on low family members (1-2):
- High income are liability customers
- High credit card spenders are also among liability customers
- Some high mortgage customers are also liability customers
Model Building¶
Model Evaluation Criterion¶
Algorithm Selection
- The business objective indicate that the problem under discussion is a classification problem, as in we will have data of customers and the requirement is to predict whether a liability customer will proceed to take out personal loans or not.
- So we have two classes: Yes/No or 1/0 and we have to predict where each observation/customer with various features will belong to
- This is not a regression problem as we are not predicting a continuous value. This is also not an unsupervised learning problem where we need some segmentation to understand different groups using something like clustering (so far).
- The appropriate model for prediction of classes seems to be Decision Trees
Model Evaluation
The objective is to maximize revenue by converting liability customers to asset customers via Personal Loan applications. There are a few inferences from this:
- Having customers convert to Personal Loan applicants is highly important from revenue point of view. Losing a potential customer because they were not targeted by marketing efforts is assumed to be very costly
- The business will end up spending efforts to reach out potential applicants and offer them incentives for conversion. This will require expense but this expense is not comparable to the previous point: marketing/conversion expense on one potential customer who didn't convert is not nearly as heavy as the loss of a potential customer who was able to convert but was not reached out.
Based on the above points, we will create different models which will be evaluated on the following criteria:
- Performance Metrics
- The model will be trained on training data and used for prediction on test data. The model should be able to generalize well (not lose much performance between training and test sets)
- Recall is most important, as it penalizes False Negatives (customers who could have converted but were incorrectly predicted by model as not probable for conversion). Low Recall represents potential asset customers that were lost resulting in lost revenue. Recall needs to be very high (considering percentage of conversion w.r.t overall data is at 9%)
- Precision is the parameter that will penalize False Positives (customers that were predicted to be good candidates for conversion by the model but in actual they weren't). This represents marketing efforts and costs that will be wasted. Precision needs to be moderately high (but this is not the primary metric for us to judge by)
- F1-Score is the parameter that is a combination of both Recall and Precision. Essentially if we have a high Recall and a comparably high F1-Score it is good enough for the purposes of our model selection
- Complexity
- Model Depth the depth of the tree should not be very large. Given two models with same performance metrics, the one with less depth will be preferred
- Maximum number of leaf nodes the number of nodes and number of leaf nodes should not be very high. Given two models with similar performance metrics, the one will lesser number of maximum nodes will be preferred.
- Reduced Complexity will help in better visualization and interpretation of the model and its key drivers
- Performance Metrics
Following functions have been created to calculate different metrics and confusion matrixto avoid repitition for each model:
- The classification_performance_metrics function will be used to check the model performance of models.
- The classification_confusion_matrix function will be used to plot confusion matrix.
- model_complexity function will be used to check the complexity of the model
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def classification_performance_metrics(model, predictors, target):
"""
give accuracy, recall, precision and f1-score for a classification model
model: classifier - decision trees
predictors: independent/input variables
target: dependent/output/response variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
performance_metrics = pd.DataFrame({"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},index=[0])
return performance_metrics
def classification_confusion_matrix(model, predictors, target):
"""
Gives confusion_matrix
model: classifier - decision trees
predictors: independent/input variables
target: dependent/output/response variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]for item in cm.flatten()]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
def model_complexity(trained_model):
"""
Function to get the model complexity parameters
trained_model: decision tree classifier model
"""
depth = trained_model.get_depth() # depth of the tree
n_leaves = trained_model.get_n_leaves() # number of leaves in the tree
node_count = trained_model.tree_.node_count # number of nodes in the tree
model_complexity_df = pd.DataFrame({'Depth': depth, 'Number of leaves': n_leaves, 'Number of nodes': node_count}, index=[0])
return model_complexity_df
Decision Tree (default)¶
In this section, a decision tree will be built with default sklearn parameters
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Checking model performance on training data¶
classification_confusion_matrix(model, X_train, y_train)
dtree1_perf_train = classification_performance_metrics(model, X_train, y_train)
dtree1_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
dtree1_complexity = model_complexity(model)
dtree1_complexity
| Depth | Number of leaves | Number of nodes | |
|---|---|---|---|
| 0 | 15 | 43 | 85 |
Visualizing the Decision Tree¶
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_90007', 'ZIPCode_90009', 'ZIPCode_90011', 'ZIPCode_90016', 'ZIPCode_90018', 'ZIPCode_90019', 'ZIPCode_90024', 'ZIPCode_90025', 'ZIPCode_90027', 'ZIPCode_90028', 'ZIPCode_90029', 'ZIPCode_90032', 'ZIPCode_90033', 'ZIPCode_90034', 'ZIPCode_90035', 'ZIPCode_90036', 'ZIPCode_90037', 'ZIPCode_90041', 'ZIPCode_90044', 'ZIPCode_90045', 'ZIPCode_90048', 'ZIPCode_90049', 'ZIPCode_90057', 'ZIPCode_90058', 'ZIPCode_90059', 'ZIPCode_90064', 'ZIPCode_90065', 'ZIPCode_90066', 'ZIPCode_90068', 'ZIPCode_90071', 'ZIPCode_90073', 'ZIPCode_90086', 'ZIPCode_90089', 'ZIPCode_90095', 'ZIPCode_90210', 'ZIPCode_90212', 'ZIPCode_90230', 'ZIPCode_90232', 'ZIPCode_90245', 'ZIPCode_90250', 'ZIPCode_90254', 'ZIPCode_90266', 'ZIPCode_90272', 'ZIPCode_90274', 'ZIPCode_90275', 'ZIPCode_90277', 'ZIPCode_90280', 'ZIPCode_90291', 'ZIPCode_90304', 'ZIPCode_90401', 'ZIPCode_90404', 'ZIPCode_90405', 'ZIPCode_90502', 'ZIPCode_90503', 'ZIPCode_90504', 'ZIPCode_90505', 'ZIPCode_90509', 'ZIPCode_90601', 'ZIPCode_90623', 'ZIPCode_90630', 'ZIPCode_90638', 'ZIPCode_90639', 'ZIPCode_90640', 'ZIPCode_90650', 'ZIPCode_90717', 'ZIPCode_90720', 'ZIPCode_90740', 'ZIPCode_90745', 'ZIPCode_90747', 'ZIPCode_90755', 'ZIPCode_90813', 'ZIPCode_90840', 'ZIPCode_91006', 'ZIPCode_91007', 'ZIPCode_91016', 'ZIPCode_91024', 'ZIPCode_91030', 'ZIPCode_91040', 'ZIPCode_91101', 'ZIPCode_91103', 'ZIPCode_91105', 'ZIPCode_91107', 'ZIPCode_91109', 'ZIPCode_91116', 'ZIPCode_91125', 'ZIPCode_91129', 'ZIPCode_91203', 'ZIPCode_91207', 'ZIPCode_91301', 'ZIPCode_91302', 'ZIPCode_91304', 'ZIPCode_91311', 'ZIPCode_91320', 'ZIPCode_91326', 'ZIPCode_91330', 'ZIPCode_91335', 'ZIPCode_91342', 'ZIPCode_91343', 'ZIPCode_91345', 'ZIPCode_91355', 'ZIPCode_91360', 'ZIPCode_91361', 'ZIPCode_91365', 'ZIPCode_91367', 'ZIPCode_91380', 'ZIPCode_91401', 'ZIPCode_91423', 'ZIPCode_91604', 'ZIPCode_91605', 'ZIPCode_91614', 'ZIPCode_91706', 'ZIPCode_91709', 'ZIPCode_91710', 'ZIPCode_91711', 'ZIPCode_91730', 'ZIPCode_91741', 'ZIPCode_91745', 'ZIPCode_91754', 'ZIPCode_91763', 'ZIPCode_91765', 'ZIPCode_91768', 'ZIPCode_91770', 'ZIPCode_91773', 'ZIPCode_91775', 'ZIPCode_91784', 'ZIPCode_91791', 'ZIPCode_91801', 'ZIPCode_91902', 'ZIPCode_91910', 'ZIPCode_91911', 'ZIPCode_91941', 'ZIPCode_91942', 'ZIPCode_91950', 'ZIPCode_92007', 'ZIPCode_92008', 'ZIPCode_92009', 'ZIPCode_92024', 'ZIPCode_92028', 'ZIPCode_92029', 'ZIPCode_92037', 'ZIPCode_92038', 'ZIPCode_92054', 'ZIPCode_92056', 'ZIPCode_92064', 'ZIPCode_92068', 'ZIPCode_92069', 'ZIPCode_92084', 'ZIPCode_92093', 'ZIPCode_92096', 'ZIPCode_92101', 'ZIPCode_92103', 'ZIPCode_92104', 'ZIPCode_92106', 'ZIPCode_92109', 'ZIPCode_92110', 'ZIPCode_92115', 'ZIPCode_92116', 'ZIPCode_92120', 'ZIPCode_92121', 'ZIPCode_92122', 'ZIPCode_92123', 'ZIPCode_92124', 'ZIPCode_92126', 'ZIPCode_92129', 'ZIPCode_92130', 'ZIPCode_92131', 'ZIPCode_92152', 'ZIPCode_92154', 'ZIPCode_92161', 'ZIPCode_92173', 'ZIPCode_92177', 'ZIPCode_92182', 'ZIPCode_92192', 'ZIPCode_92220', 'ZIPCode_92251', 'ZIPCode_92325', 'ZIPCode_92333', 'ZIPCode_92346', 'ZIPCode_92350', 'ZIPCode_92354', 'ZIPCode_92373', 'ZIPCode_92374', 'ZIPCode_92399', 'ZIPCode_92407', 'ZIPCode_92507', 'ZIPCode_92518', 'ZIPCode_92521', 'ZIPCode_92606', 'ZIPCode_92612', 'ZIPCode_92614', 'ZIPCode_92624', 'ZIPCode_92626', 'ZIPCode_92630', 'ZIPCode_92634', 'ZIPCode_92646', 'ZIPCode_92647', 'ZIPCode_92648', 'ZIPCode_92653', 'ZIPCode_92660', 'ZIPCode_92661', 'ZIPCode_92672', 'ZIPCode_92673', 'ZIPCode_92675', 'ZIPCode_92677', 'ZIPCode_92691', 'ZIPCode_92692', 'ZIPCode_92694', 'ZIPCode_92697', 'ZIPCode_92703', 'ZIPCode_92704', 'ZIPCode_92705', 'ZIPCode_92709', 'ZIPCode_92717', 'ZIPCode_92735', 'ZIPCode_92780', 'ZIPCode_92806', 'ZIPCode_92807', 'ZIPCode_92821', 'ZIPCode_92831', 'ZIPCode_92833', 'ZIPCode_92834', 'ZIPCode_92835', 'ZIPCode_92843', 'ZIPCode_92866', 'ZIPCode_92867', 'ZIPCode_92868', 'ZIPCode_92870', 'ZIPCode_92886', 'ZIPCode_93003', 'ZIPCode_93009', 'ZIPCode_93010', 'ZIPCode_93014', 'ZIPCode_93022', 'ZIPCode_93023', 'ZIPCode_93033', 'ZIPCode_93063', 'ZIPCode_93065', 'ZIPCode_93077', 'ZIPCode_93101', 'ZIPCode_93105', 'ZIPCode_93106', 'ZIPCode_93107', 'ZIPCode_93108', 'ZIPCode_93109', 'ZIPCode_93111', 'ZIPCode_93117', 'ZIPCode_93118', 'ZIPCode_93302', 'ZIPCode_93305', 'ZIPCode_93311', 'ZIPCode_93401', 'ZIPCode_93403', 'ZIPCode_93407', 'ZIPCode_93437', 'ZIPCode_93460', 'ZIPCode_93524', 'ZIPCode_93555', 'ZIPCode_93561', 'ZIPCode_93611', 'ZIPCode_93657', 'ZIPCode_93711', 'ZIPCode_93720', 'ZIPCode_93727', 'ZIPCode_93907', 'ZIPCode_93933', 'ZIPCode_93940', 'ZIPCode_93943', 'ZIPCode_93950', 'ZIPCode_93955', 'ZIPCode_94002', 'ZIPCode_94005', 'ZIPCode_94010', 'ZIPCode_94015', 'ZIPCode_94019', 'ZIPCode_94022', 'ZIPCode_94024', 'ZIPCode_94025', 'ZIPCode_94028', 'ZIPCode_94035', 'ZIPCode_94040', 'ZIPCode_94043', 'ZIPCode_94061', 'ZIPCode_94063', 'ZIPCode_94065', 'ZIPCode_94066', 'ZIPCode_94080', 'ZIPCode_94085', 'ZIPCode_94086', 'ZIPCode_94087', 'ZIPCode_94102', 'ZIPCode_94104', 'ZIPCode_94105', 'ZIPCode_94107', 'ZIPCode_94108', 'ZIPCode_94109', 'ZIPCode_94110', 'ZIPCode_94111', 'ZIPCode_94112', 'ZIPCode_94114', 'ZIPCode_94115', 'ZIPCode_94116', 'ZIPCode_94117', 'ZIPCode_94118', 'ZIPCode_94122', 'ZIPCode_94123', 'ZIPCode_94124', 'ZIPCode_94126', 'ZIPCode_94131', 'ZIPCode_94132', 'ZIPCode_94143', 'ZIPCode_94234', 'ZIPCode_94301', 'ZIPCode_94302', 'ZIPCode_94303', 'ZIPCode_94304', 'ZIPCode_94305', 'ZIPCode_94306', 'ZIPCode_94309', 'ZIPCode_94402', 'ZIPCode_94404', 'ZIPCode_94501', 'ZIPCode_94507', 'ZIPCode_94509', 'ZIPCode_94521', 'ZIPCode_94523', 'ZIPCode_94526', 'ZIPCode_94534', 'ZIPCode_94536', 'ZIPCode_94538', 'ZIPCode_94539', 'ZIPCode_94542', 'ZIPCode_94545', 'ZIPCode_94546', 'ZIPCode_94550', 'ZIPCode_94551', 'ZIPCode_94553', 'ZIPCode_94555', 'ZIPCode_94558', 'ZIPCode_94566', 'ZIPCode_94571', 'ZIPCode_94575', 'ZIPCode_94577', 'ZIPCode_94583', 'ZIPCode_94588', 'ZIPCode_94590', 'ZIPCode_94591', 'ZIPCode_94596', 'ZIPCode_94598', 'ZIPCode_94604', 'ZIPCode_94606', 'ZIPCode_94607', 'ZIPCode_94608', 'ZIPCode_94609', 'ZIPCode_94610', 'ZIPCode_94611', 'ZIPCode_94612', 'ZIPCode_94618', 'ZIPCode_94701', 'ZIPCode_94703', 'ZIPCode_94704', 'ZIPCode_94705', 'ZIPCode_94706', 'ZIPCode_94707', 'ZIPCode_94708', 'ZIPCode_94709', 'ZIPCode_94710', 'ZIPCode_94720', 'ZIPCode_94801', 'ZIPCode_94803', 'ZIPCode_94806', 'ZIPCode_94901', 'ZIPCode_94904', 'ZIPCode_94920', 'ZIPCode_94923', 'ZIPCode_94928', 'ZIPCode_94939', 'ZIPCode_94949', 'ZIPCode_94960', 'ZIPCode_94965', 'ZIPCode_94970', 'ZIPCode_94998', 'ZIPCode_95003', 'ZIPCode_95005', 'ZIPCode_95006', 'ZIPCode_95008', 'ZIPCode_95010', 'ZIPCode_95014', 'ZIPCode_95020', 'ZIPCode_95023', 'ZIPCode_95032', 'ZIPCode_95035', 'ZIPCode_95037', 'ZIPCode_95039', 'ZIPCode_95045', 'ZIPCode_95051', 'ZIPCode_95053', 'ZIPCode_95054', 'ZIPCode_95060', 'ZIPCode_95064', 'ZIPCode_95070', 'ZIPCode_95112', 'ZIPCode_95120', 'ZIPCode_95123', 'ZIPCode_95125', 'ZIPCode_95126', 'ZIPCode_95131', 'ZIPCode_95133', 'ZIPCode_95134', 'ZIPCode_95135', 'ZIPCode_95136', 'ZIPCode_95138', 'ZIPCode_95192', 'ZIPCode_95193', 'ZIPCode_95207', 'ZIPCode_95211', 'ZIPCode_95307', 'ZIPCode_95348', 'ZIPCode_95351', 'ZIPCode_95354', 'ZIPCode_95370', 'ZIPCode_95403', 'ZIPCode_95405', 'ZIPCode_95422', 'ZIPCode_95449', 'ZIPCode_95482', 'ZIPCode_95503', 'ZIPCode_95518', 'ZIPCode_95521', 'ZIPCode_95605', 'ZIPCode_95616', 'ZIPCode_95617', 'ZIPCode_95621', 'ZIPCode_95630', 'ZIPCode_95670', 'ZIPCode_95678', 'ZIPCode_95741', 'ZIPCode_95747', 'ZIPCode_95758', 'ZIPCode_95762', 'ZIPCode_95812', 'ZIPCode_95814', 'ZIPCode_95816', 'ZIPCode_95817', 'ZIPCode_95818', 'ZIPCode_95819', 'ZIPCode_95820', 'ZIPCode_95821', 'ZIPCode_95822', 'ZIPCode_95825', 'ZIPCode_95827', 'ZIPCode_95828', 'ZIPCode_95831', 'ZIPCode_95833', 'ZIPCode_95841', 'ZIPCode_95842', 'ZIPCode_95929', 'ZIPCode_95973', 'ZIPCode_96001', 'ZIPCode_96003', 'ZIPCode_96008', 'ZIPCode_96064', 'ZIPCode_96091', 'ZIPCode_96094', 'ZIPCode_96145', 'ZIPCode_96150', 'ZIPCode_96651', 'Education_2', 'Education_3']
plt.figure(figsize=(30, 30))
out = tree.plot_tree(model, feature_names=feature_names,filled=True,fontsize=8,node_ids=False,class_names=None)
plt.show()
Decision Rules:
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ZIPCode_90049 <= 0.50 | | | | | |--- ZIPCode_92007 <= 0.50 | | | | | | |--- ZIPCode_93106 <= 0.50 | | | | | | | |--- weights: [63.00, 0.00] class: 0 | | | | | | |--- ZIPCode_93106 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- ZIPCode_92007 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_90049 > 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- CCAvg <= 2.40 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.40 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- ZIPCode_91360 <= 0.50 | | | | | |--- ZIPCode_92220 <= 0.50 | | | | | | |--- ZIPCode_94709 <= 0.50 | | | | | | | |--- ZIPCode_92521 <= 0.50 | | | | | | | | |--- ZIPCode_91203 <= 0.50 | | | | | | | | | |--- ZIPCode_94122 <= 0.50 | | | | | | | | | | |--- ZIPCode_94105 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- ZIPCode_94105 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- ZIPCode_94122 > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- ZIPCode_91203 > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ZIPCode_92521 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- ZIPCode_94709 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- ZIPCode_92220 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_91360 > 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Family <= 2.50 | | | | |--- Education_2 <= 0.50 | | | | | |--- Education_3 <= 0.50 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- ZIPCode_90034 <= 0.50 | | | | | | | | |--- weights: [28.00, 0.00] class: 0 | | | | | | | |--- ZIPCode_90034 > 0.50 | | | | | | | | |--- Income <= 103.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Income > 103.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- CCAvg <= 4.75 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- CCAvg > 4.75 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Education_3 > 0.50 | | | | | | |--- CCAvg <= 3.95 | | | | | | | |--- ZIPCode_90277 <= 0.50 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- ZIPCode_90277 > 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.95 | | | | | | | |--- Income <= 107.00 | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Income > 107.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Education_2 > 0.50 | | | | | |--- weights: [0.00, 4.00] class: 1 | | | |--- Family > 2.50 | | | | |--- Age <= 57.50 | | | | | |--- ZIPCode_90245 <= 0.50 | | | | | | |--- weights: [0.00, 20.00] class: 1 | | | | | |--- ZIPCode_90245 > 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Age > 57.50 | | | | | |--- Income <= 97.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Income > 97.50 | | | | | | |--- ZIPCode_94606 <= 0.50 | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | |--- ZIPCode_94606 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 |--- Income > 116.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- weights: [0.00, 53.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [0.00, 62.00] class: 1 | |--- Family > 2.50 | | |--- weights: [0.00, 154.00] class: 1
Model Feature Importance¶
- model.feature_importances_ is an attribute that ranks how "important" or "useful" each feature was in making predictions.
How it works. It assigns a score to every input feature. The scores are typically non-negative and sum up to 1.0 (or 100%).
- A high score means the feature was used frequently to split the data and significantly reduced the model's error (high "Gini importance" or "Gain").
- A low score means the feature had little to no impact on the final prediction.
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
feature_importance = pd.DataFrame(
model.feature_importances_, columns=["Importance"], index=X_train.columns).sort_values(by="Importance", ascending=False)
print(feature_importance)
Importance Income 0.308577 Family 0.246862 Education_2 0.165238 Education_3 0.144207 CCAvg 0.048662 ... ... ZIPCode_90036 0.000000 ZIPCode_90037 0.000000 ZIPCode_90041 0.000000 ZIPCode_90044 0.000000 ZIPCode_95135 0.000000 [477 rows x 1 columns]
There are a lot of values due to converting ZIPCode to one-hot encoding format, but almost all of these values don't contribute anything to prediction. Hence, going forward a feature importance threshold will be calculated for all models and only the important ones will be evaluated
filtered_features = feature_importance[feature_importance['Importance'] > 0.004]
print(filtered_features)
Importance Income 0.308577 Family 0.246862 Education_2 0.165238 Education_3 0.144207 CCAvg 0.048662 Age 0.024457 CD_Account 0.017273
plt.barh(range(len(filtered_features)), filtered_features['Importance'], color="violet", align="center")
plt.yticks(range(len(filtered_features)), [filtered_features.index[i] for i in range(len(filtered_features))])
plt.xlabel("Feature Importance")
plt.show()
Checking model performance on test data¶
classification_confusion_matrix(model, X_test, y_test)
dtree1_perf_test = classification_performance_metrics(model, X_test, y_test)
dtree1_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984 | 0.879195 | 0.956204 | 0.916084 |
Model Performance Improvement¶
Pre-Pruning¶
General Iteration¶
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 9, 1)
max_leaf_nodes_values = [50, 75, 100, 150]
min_samples_split_values = [10, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=1
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate recall scores for training and test sets
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found: Max depth: 2 Max leaf nodes: 50 Min samples split: 10 Best test recall score: 1.0
Evaluating Pre-Pruned Tree (dtree2)¶
# Fit the best algorithm to the data.
dtree2 = best_estimator
dtree2.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(2),
max_leaf_nodes=50, min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Checking performance on training data
classification_confusion_matrix(dtree2, X_train, y_train)
dtree2_perf_train = classification_performance_metrics(dtree2, X_train, y_train)
dtree2_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.790286 | 1.0 | 0.310798 | 0.474212 |
dtree2_complexity = model_complexity(dtree2)
dtree2_complexity
| Depth | Number of leaves | Number of nodes | |
|---|---|---|---|
| 0 | 2 | 4 | 7 |
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
dtree2,
feature_names=feature_names,
filled=True,
fontsize=10,
node_ids=False,
class_names=None,
)
plt.show()
Decision Rules:
# Text report showing the rules of a decision tree -
print(tree.export_text(dtree2, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1344.67, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [64.61, 79.31] class: 1 |--- Income > 92.50 | |--- Family <= 2.50 | | |--- weights: [298.20, 697.89] class: 1 | |--- Family > 2.50 | | |--- weights: [42.52, 972.81] class: 1
Feature Importance
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
dtree2_feature_importance = pd.DataFrame(
dtree2.feature_importances_, columns=["Importance"], index=X_train.columns).sort_values(by="Importance", ascending=False)
filtered_dtree2_feature_importance = dtree2_feature_importance[dtree2_feature_importance['Importance'] > 0]
filtered_dtree2_feature_importance
| Importance | |
|---|---|
| Income | 0.876529 |
| CCAvg | 0.066940 |
| Family | 0.056531 |
plt.barh(range(len(
filtered_dtree2_feature_importance )),
filtered_dtree2_feature_importance ['Importance'],
color="violet",
align="center"
)
plt.yticks(range(len(filtered_dtree2_feature_importance )),
[filtered_dtree2_feature_importance.index[i] for i in range(len(filtered_dtree2_feature_importance))]
)
plt.xlabel("Feature Importance")
plt.show()
Checking performance on test data
classification_confusion_matrix(dtree2, X_test, y_test)
dtree2_perf_test = classification_performance_metrics(dtree2, X_test, y_test)
dtree2_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.779333 | 1.0 | 0.310417 | 0.473768 |
The precision score is very low. Although Recall is the ideal scenario (perfect prediction) we need to see if there are other options where precision doesn't have to be so low
Detailed Analysis of Pre-Pruned Tree Metrics¶
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 9, 1)
max_leaf_nodes_values = [50, 75, 100, 150]
min_samples_split_values = [10, 30, 50, 70]
model_statistics = []
estimator_list = []
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=1
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
estimator_list.append(estimator)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
model_statistics.append({
'max_depth': max_depth,
'max_leaf_nodes': max_leaf_nodes,
'min_samples_split': min_samples_split,
'Accuracy_train': accuracy_score(y_train, y_train_pred),
'Accuracy_test': accuracy_score(y_test, y_test_pred),
'Recall_train': recall_score(y_train, y_train_pred),
'Recall_test': recall_score(y_test, y_test_pred),
'Precision_train': precision_score(y_train, y_train_pred),
'Precision_test': precision_score(y_test, y_test_pred),
'F1_train': f1_score(y_train, y_train_pred),
'F1_test': f1_score(y_test, y_test_pred),
'Acc_diff': abs(accuracy_score(y_train, y_train_pred) - accuracy_score(y_test, y_test_pred)),
'Recall_diff': abs(recall_score(y_train, y_train_pred) - recall_score(y_test, y_test_pred)),
'Precision_diff': abs(precision_score(y_train, y_train_pred) - precision_score(y_test, y_test_pred)),
'F1_diff': abs(f1_score(y_train, y_train_pred) - f1_score(y_test, y_test_pred)),
})
model_statistics_df = pd.DataFrame(model_statistics)
model_statistics_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| max_depth | 112.0 | 5.000000 | 2.008989 | 2.000000 | 3.000000 | 5.000000 | 7.000000 | 8.000000 |
| max_leaf_nodes | 112.0 | 93.750000 | 37.141681 | 50.000000 | 68.750000 | 87.500000 | 112.500000 | 150.000000 |
| min_samples_split | 112.0 | 40.000000 | 22.461178 | 10.000000 | 25.000000 | 40.000000 | 55.000000 | 70.000000 |
| Accuracy_train | 112.0 | 0.921184 | 0.074288 | 0.790286 | 0.820857 | 0.963571 | 0.968000 | 0.986857 |
| Accuracy_test | 112.0 | 0.910714 | 0.076836 | 0.779333 | 0.801333 | 0.954667 | 0.959333 | 0.973333 |
| Recall_train | 112.0 | 0.985757 | 0.010938 | 0.969789 | 0.972810 | 0.984894 | 0.995468 | 1.000000 |
| Recall_test | 112.0 | 0.953260 | 0.021355 | 0.932886 | 0.939597 | 0.942953 | 0.959732 | 1.000000 |
| Precision_train | 112.0 | 0.630464 | 0.198343 | 0.310798 | 0.342217 | 0.726868 | 0.752887 | 0.877984 |
| Precision_test | 112.0 | 0.610155 | 0.186984 | 0.310417 | 0.327945 | 0.703518 | 0.729770 | 0.818713 |
| F1_train | 112.0 | 0.748975 | 0.167556 | 0.474212 | 0.505910 | 0.836434 | 0.853403 | 0.935028 |
| F1_test | 112.0 | 0.723532 | 0.155053 | 0.473768 | 0.487973 | 0.805634 | 0.821114 | 0.875000 |
| Acc_diff | 112.0 | 0.010469 | 0.005252 | 0.001143 | 0.008667 | 0.010000 | 0.012952 | 0.019524 |
| Recall_diff | 112.0 | 0.032498 | 0.020981 | 0.000000 | 0.013078 | 0.045297 | 0.045885 | 0.061072 |
| Precision_diff | 112.0 | 0.022900 | 0.018478 | 0.000381 | 0.009068 | 0.021057 | 0.024161 | 0.070970 |
| F1_diff | 112.0 | 0.025911 | 0.019627 | 0.000444 | 0.001639 | 0.031300 | 0.032373 | 0.067065 |
model_statistics_df.loc[model_statistics_df['F1_diff'].idxmin()]
max_depth 2.000000 max_leaf_nodes 50.000000 min_samples_split 10.000000 Accuracy_train 0.790286 Accuracy_test 0.779333 Recall_train 1.000000 Recall_test 1.000000 Precision_train 0.310798 Precision_test 0.310417 F1_train 0.474212 F1_test 0.473768 Acc_diff 0.010952 Recall_diff 0.000000 Precision_diff 0.000381 F1_diff 0.000444 Name: 0, dtype: float64
model_statistics_df[['Accuracy_train','Precision_train','Recall_train','F1_train']].plot(figsize=(12, 6), linewidth=2)
<Axes: >
model_statistics_df[['Accuracy_test','Precision_test','Recall_test','F1_test']].plot(figsize=(12, 6), linewidth=2)
plt.grid(True, alpha=0.7)
plt.show()
we can see that max values of all occur when:
- Precision > 0.7
- Recall > 0.95
- F1 Score > 0.8
Next, we shall attempt to find the tree that maximizes the above scores
model_statistics_df[(model_statistics_df['Recall_test']>0.95) & (model_statistics_df['F1_test']>0.8) & (model_statistics_df['Precision_test']>0.7)]
| max_depth | max_leaf_nodes | min_samples_split | Accuracy_train | Accuracy_test | Recall_train | Recall_test | Precision_train | Precision_test | F1_train | F1_test | Acc_diff | Recall_diff | Precision_diff | F1_diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64 | 6 | 50 | 10 | 0.976286 | 0.962667 | 1.0 | 0.966443 | 0.799517 | 0.738462 | 0.888591 | 0.837209 | 0.013619 | 0.033557 | 0.061055 | 0.051381 |
| 68 | 6 | 75 | 10 | 0.976286 | 0.962667 | 1.0 | 0.966443 | 0.799517 | 0.738462 | 0.888591 | 0.837209 | 0.013619 | 0.033557 | 0.061055 | 0.051381 |
| 72 | 6 | 100 | 10 | 0.976286 | 0.962667 | 1.0 | 0.966443 | 0.799517 | 0.738462 | 0.888591 | 0.837209 | 0.013619 | 0.033557 | 0.061055 | 0.051381 |
| 76 | 6 | 150 | 10 | 0.976286 | 0.962667 | 1.0 | 0.966443 | 0.799517 | 0.738462 | 0.888591 | 0.837209 | 0.013619 | 0.033557 | 0.061055 | 0.051381 |
Since all performance values mentioned above are equal, we will pick the tree with the least complexity
#max_depth = 6
#max_leaf_nodes = 50
#min_samples_split = 10
Evaluating Pre-Pruned Tree (dtree3)¶
# selecting tree with max_depth 6, max_leaf nodes = 50
dtree3 = estimator_list[64]
dtree3.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(6),
max_leaf_nodes=50, min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Checking on training data
classification_confusion_matrix(dtree3, X_train, y_train)
dtree3_perf_train = classification_performance_metrics(dtree3, X_train, y_train)
dtree3_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.976286 | 1.0 | 0.799517 | 0.888591 |
dtree3_complexity = model_complexity(dtree3)
dtree3_complexity
| Depth | Number of leaves | Number of nodes | |
|---|---|---|---|
| 0 | 6 | 27 | 53 |
Visualizing the Tree
plt.figure(figsize=(30, 20))
out = tree.plot_tree(
dtree3,
feature_names=feature_names,
filled=True,
fontsize=10,
node_ids=False,
class_names=None,)
plt.show()
Decision Rules:
print(tree.export_text(dtree3, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1344.67, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Mortgage <= 102.50 | | | | | |--- CCAvg <= 3.05 | | | | | | |--- weights: [8.28, 0.00] class: 0 | | | | | |--- CCAvg > 3.05 | | | | | | |--- weights: [21.54, 52.87] class: 1 | | | | |--- Mortgage > 102.50 | | | | | |--- weights: [11.60, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [23.19, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.00, 26.44] class: 1 |--- Income > 92.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 103.50 | | | | | |--- CCAvg <= 3.21 | | | | | | |--- weights: [22.09, 0.00] class: 0 | | | | | |--- CCAvg > 3.21 | | | | | | |--- weights: [2.76, 15.86] class: 1 | | | | |--- Income > 103.50 | | | | | |--- weights: [239.11, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 110.00 | | | | | |--- CCAvg <= 2.90 | | | | | | |--- weights: [12.70, 0.00] class: 0 | | | | | |--- CCAvg > 2.90 | | | | | | |--- weights: [0.00, 10.57] class: 1 | | | | |--- Income > 110.00 | | | | | |--- Income <= 116.50 | | | | | | |--- weights: [3.87, 15.86] class: 1 | | | | | |--- Income > 116.50 | | | | | | |--- weights: [0.00, 280.21] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [7.73, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Income <= 94.00 | | | | | | |--- weights: [2.21, 0.00] class: 0 | | | | | |--- Income > 94.00 | | | | | | |--- weights: [7.73, 47.58] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 327.79] class: 1 | |--- Family > 2.50 | | |--- Income <= 113.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [24.85, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Age <= 28.50 | | | | | | |--- weights: [4.97, 0.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- weights: [6.07, 31.72] class: 1 | | | |--- CCAvg > 2.80 | | | | |--- ZIPCode_95123 <= 0.50 | | | | | |--- ZIPCode_95134 <= 0.50 | | | | | | |--- weights: [2.21, 95.17] class: 1 | | | | | |--- ZIPCode_95134 > 0.50 | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | |--- ZIPCode_95123 > 0.50 | | | | | |--- weights: [0.55, 0.00] class: 0 | | |--- Income > 113.50 | | | |--- ZIPCode_94608 <= 0.50 | | | | |--- Age <= 66.00 | | | | | |--- ZIPCode_94105 <= 0.50 | | | | | | |--- weights: [1.66, 845.92] class: 1 | | | | | |--- ZIPCode_94105 > 0.50 | | | | | | |--- weights: [0.55, 0.00] class: 0 | | | | |--- Age > 66.00 | | | | | |--- weights: [0.55, 0.00] class: 0 | | | |--- ZIPCode_94608 > 0.50 | | | | |--- weights: [0.55, 0.00] class: 0
Feature Importance
dtree3_feature_importance = pd.DataFrame(
dtree3.feature_importances_, columns=["Importance"], index=X_train.columns).sort_values(by="Importance", ascending=False)
filtered_dtree3_feature_importance = dtree3_feature_importance[dtree3_feature_importance['Importance'] > 0.003]
filtered_dtree3_feature_importance
| Importance | |
|---|---|
| Income | 0.656667 |
| Education_2 | 0.142307 |
| CCAvg | 0.091533 |
| Education_3 | 0.050106 |
| Family | 0.039750 |
| CD_Account | 0.007782 |
| Mortgage | 0.004957 |
| Age | 0.004344 |
plt.barh(range(len(
filtered_dtree3_feature_importance )),
filtered_dtree3_feature_importance ['Importance'],
color="violet",
align="center"
)
plt.yticks(range(len(filtered_dtree3_feature_importance )),
[filtered_dtree3_feature_importance.index[i] for i in range(len(filtered_dtree3_feature_importance))]
)
plt.xlabel("Feature Importance")
plt.show()
Checking on Test Data
classification_confusion_matrix(dtree3, X_test, y_test)
dtree3_perf_test = classification_performance_metrics(dtree3, X_test, y_test)
dtree3_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.962667 | 0.966443 | 0.738462 | 0.837209 |
Post-pruning¶
The second method for pruning (post-pruning) shall be assessed in the following section. This involves growing the tree first fully and then pruning it according to the "cost-complexity" parameter or alpha. Cost-complexity considers increase in error and number of nodes (as complexity) for each pruned sub-tree and gives us a complete path of the most effective values of alpha for pruning the tree
clf = DecisionTreeClassifier(random_state=1) # instantiating a new decision tree
path = clf.cost_complexity_pruning_path(X_train, y_train) # computing the cost complexity pruning path over the training data
ccp_alphas, impurities = path.ccp_alphas, path.impurities
Without any threshold specified, pruning will continue until there is only one node (root node) left in the tree
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000276 | 0.000552 |
| 2 | 0.000279 | 0.002224 |
| 3 | 0.000381 | 0.002605 |
| 4 | 0.000476 | 0.003081 |
| 5 | 0.000500 | 0.003581 |
| 6 | 0.000513 | 0.007174 |
| 7 | 0.000527 | 0.007701 |
| 8 | 0.000544 | 0.008246 |
| 9 | 0.000545 | 0.009882 |
| 10 | 0.000625 | 0.010507 |
| 11 | 0.000700 | 0.011207 |
| 12 | 0.000762 | 0.012731 |
| 13 | 0.000882 | 0.016260 |
| 14 | 0.000940 | 0.017200 |
| 15 | 0.001305 | 0.018505 |
| 16 | 0.001647 | 0.020153 |
| 17 | 0.002333 | 0.022486 |
| 18 | 0.002407 | 0.024893 |
| 19 | 0.003294 | 0.028187 |
| 20 | 0.006473 | 0.034659 |
| 21 | 0.025146 | 0.084951 |
| 22 | 0.039216 | 0.124167 |
| 23 | 0.047088 | 0.171255 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post") # plotting effective alpha vs total impurity, leaving out the last value with a single node
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas: # looping through all ccp_alpha values
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha) # instantiating decision tree with each ccp_alpha value
clf.fit(X_train, y_train)
clfs.append(clf)
# number of nodes in the last tree
print(f"Root tree nodes, ccp_alpha: {clfs[-1].tree_.node_count}, {float(ccp_alphas[-1])}")
Root tree nodes, ccp_alpha: 1, 0.04708834100596766
- Following is a plot of alpha against:
- Depth of the tree
- Number of nodes in the tree
Number of nodes and depth are both elements we can look at to gauge the complexity of the tree. Increasing alpha values will indicate towards reducing complexity which means reduction in depth and number of nodes. We are looking for a balanced value where the resulting tree is not too complex, yet it does not lose too much of its prediction performance
clfs = clfs[:-1] # removing the last tree which is the trivial tree with only one node
ccp_alphas = ccp_alphas[:-1] # removing the last ccp_alpha which is for the tree with only one node
node_counts = [clf.tree_.node_count for clf in clfs] # getting node count for each tree
depth = [clf.tree_.max_depth for clf in clfs] # getting depth of each tree
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Recall vs alpha for training and testing sets
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1-score")
ax.set_title("F1 vs Alpha: Test & Training")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model_recall = np.argmax(recall_test)
best_model_recall = clfs[index_best_model_recall]
float(best_model_recall.ccp_alpha)
0.0
index_best_model_f1 = np.argmax(f1_test)
best_model_f1 = clfs[index_best_model_f1]
float(best_model_f1.ccp_alpha)
0.0007619047619047619
# creating decision trees with best ccp_alpha values according to best recall and f1-score
# including class_weight parameter to mention class importance
dtree4 = DecisionTreeClassifier(ccp_alpha=best_model_recall.ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1)
dtree4.fit(X_train, y_train)
dtree5 = DecisionTreeClassifier(ccp_alpha=best_model_f1.ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1)
dtree5.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=np.float64(0.0007619047619047619),
class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Training & Evaluating Tree on Alpha value with best Recall¶
Checking performance on training data
classification_confusion_matrix(dtree4, X_train, y_train)
dtree4_perf_train = classification_performance_metrics(dtree4, X_train, y_train)
dtree4_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
dtree4_complexity = model_complexity(dtree4)
dtree4_complexity
| Depth | Number of leaves | Number of nodes | |
|---|---|---|---|
| 0 | 16 | 72 | 143 |
Visualizing the Decision Tree
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
dtree4,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
plt.show()
Decision Rules:
# Text report showing the rules of a decision tree -
print(tree.export_text(dtree4, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Age > 36.50 | | | | | | |--- ZIPCode_91203 <= 0.50 | | | | | | | |--- ZIPCode_93907 <= 0.50 | | | | | | | | |--- weights: [6.45, 0.00] class: 0 | | | | | | | |--- ZIPCode_93907 > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- ZIPCode_91203 > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 81.50 | | | | | |--- Mortgage <= 152.00 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.05 | | | | | | | | |--- ZIPCode_95616 <= 0.50 | | | | | | | | | |--- Age <= 63.50 | | | | | | | | | | |--- ZIPCode_95521 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- ZIPCode_95521 > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- Age > 63.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- ZIPCode_95616 > 0.50 | | | | | | | | | |--- Age <= 42.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- Age > 42.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- ZIPCode_95812 <= 0.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- ZIPCode_95812 > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Mortgage > 152.00 | | | | | | |--- ZIPCode_94550 <= 0.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- ZIPCode_94550 > 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.50 | | | | |--- weights: [0.00, 6.80] class: 1 | | | |--- CCAvg > 4.50 | | | | |--- weights: [0.15, 0.00] class: 0 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 100.00 | | | | | |--- CCAvg <= 4.20 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- CCAvg > 4.20 | | | | | | |--- ZIPCode_90245 <= 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- ZIPCode_90245 > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 100.00 | | | | | |--- ZIPCode_91367 <= 0.50 | | | | | | |--- ZIPCode_90245 <= 0.50 | | | | | | | |--- weights: [66.00, 0.00] class: 0 | | | | | | |--- ZIPCode_90245 > 0.50 | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | |--- ZIPCode_91367 > 0.50 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 110.00 | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | |--- Income > 110.00 | | | | | |--- Income <= 116.50 | | | | | | |--- Mortgage <= 141.50 | | | | | | | |--- Income <= 114.50 | | | | | | | | |--- Age <= 48.50 | | | | | | | | | |--- ZIPCode_91125 <= 0.50 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | | |--- ZIPCode_91125 > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Age > 48.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Income > 114.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Mortgage > 141.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Income > 116.50 | | | | | | |--- weights: [0.00, 45.05] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 41.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- CCAvg <= 2.25 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- CCAvg > 2.25 | | | | | | | | |--- Mortgage <= 183.00 | | | | | | | | | |--- Age <= 30.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- Age > 30.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Mortgage > 183.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- Age > 41.50 | | | | | | |--- Income <= 100.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Income > 100.00 | | | | | | | |--- ZIPCode_94086 <= 0.50 | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | |--- ZIPCode_94086 > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 52.70] class: 1 | |--- Family > 2.50 | | |--- Income <= 113.50 | | | |--- CCAvg <= 2.75 | | | | |--- Income <= 106.50 | | | | | |--- weights: [3.90, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Age <= 28.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- Age <= 60.00 | | | | | | | | |--- ZIPCode_94304 <= 0.50 | | | | | | | | | |--- ZIPCode_94305 <= 0.50 | | | | | | | | | | |--- ZIPCode_95054 <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | | | | |--- ZIPCode_95054 > 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- ZIPCode_94305 > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- ZIPCode_94304 > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Age > 60.00 | | | | | | | | |--- ZIPCode_92096 <= 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- ZIPCode_92096 > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 57.00 | | | | | |--- ZIPCode_90245 <= 0.50 | | | | | | |--- weights: [0.00, 11.90] class: 1 | | | | | |--- ZIPCode_90245 > 0.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 57.00 | | | | | |--- ZIPCode_95616 <= 0.50 | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- ZIPCode_95616 > 0.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | |--- Income > 113.50 | | | |--- ZIPCode_94608 <= 0.50 | | | | |--- Age <= 66.00 | | | | | |--- ZIPCode_94105 <= 0.50 | | | | | | |--- Income <= 116.50 | | | | | | | |--- CCAvg <= 2.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- CCAvg > 2.50 | | | | | | | | |--- ZIPCode_90245 <= 0.50 | | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | | |--- ZIPCode_90245 > 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Income > 116.50 | | | | | | | |--- weights: [0.00, 130.90] class: 1 | | | | | |--- ZIPCode_94105 > 0.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 66.00 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- ZIPCode_94608 > 0.50 | | | | |--- weights: [0.15, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
post_prune_recall_importances = pd.DataFrame(dtree4.feature_importances_, columns=["Imp"], index=X_train.columns).sort_values(by="Imp", ascending=False)
filtered_pp_imp_recall = post_prune_recall_importances[post_prune_recall_importances['Imp'] > 0.005]
filtered_pp_imp_recall
| Imp | |
|---|---|
| Income | 0.591613 |
| Education_2 | 0.136686 |
| CCAvg | 0.078673 |
| Education_3 | 0.067619 |
| Family | 0.065419 |
| Age | 0.019640 |
| CD_Account | 0.011000 |
| Mortgage | 0.005821 |
plt.barh(range(len(filtered_pp_imp_recall)), filtered_pp_imp_recall['Imp'], color="violet", align="center")
plt.yticks(range(len(filtered_pp_imp_recall)), [filtered_pp_imp_recall.index[i] for i in range(len(filtered_pp_imp_recall))])
plt.xlabel("Feature Importance")
plt.show()
Checking performance on test data
classification_confusion_matrix(dtree4, X_test, y_test)
dtree4_perf_test = classification_performance_metrics(dtree4, X_test, y_test)
dtree4_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.980667 | 0.872483 | 0.928571 | 0.899654 |
Training & Evaluating Tree on Alpha value with best F1-Score¶
classification_confusion_matrix(dtree5, X_train, y_train)
dtree5_perf_train = classification_performance_metrics(dtree5, X_train, y_train)
dtree5_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988286 | 1.0 | 0.889785 | 0.941679 |
dtree5_complexity = model_complexity(dtree5)
dtree5_complexity
| Depth | Number of leaves | Number of nodes | |
|---|---|---|---|
| 0 | 7 | 30 | 59 |
Visualizing the Tree
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
dtree5,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
plt.show()
Decision Rules:
print(tree.export_text(dtree5, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- weights: [0.15, 1.70] class: 1 | | | | | | |--- Family > 3.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Age > 36.50 | | | | | | |--- ZIPCode_91203 <= 0.50 | | | | | | | |--- weights: [6.60, 0.00] class: 0 | | | | | | |--- ZIPCode_91203 > 0.50 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 81.50 | | | | | |--- Mortgage <= 152.00 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- weights: [2.70, 9.35] class: 1 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Mortgage > 152.00 | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.15, 6.80] class: 1 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 100.00 | | | | | |--- CCAvg <= 4.20 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- CCAvg > 4.20 | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | |--- Income > 100.00 | | | | | |--- ZIPCode_91367 <= 0.50 | | | | | | |--- weights: [67.05, 0.00] class: 0 | | | | | |--- ZIPCode_91367 > 0.50 | | | | | | |--- weights: [0.15, 0.85] class: 1 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 110.00 | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | |--- Income > 110.00 | | | | | |--- Income <= 116.50 | | | | | | |--- Mortgage <= 141.50 | | | | | | | |--- weights: [0.45, 2.55] class: 1 | | | | | | |--- Mortgage > 141.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Income > 116.50 | | | | | | |--- weights: [0.00, 45.05] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 41.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.45, 1.70] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- Age > 41.50 | | | | | | |--- weights: [0.30, 5.10] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 52.70] class: 1 | |--- Family > 2.50 | | |--- Income <= 113.50 | | | |--- CCAvg <= 2.75 | | | | |--- Income <= 106.50 | | | | | |--- weights: [3.90, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Age <= 28.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- weights: [0.75, 5.10] class: 1 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 57.00 | | | | | |--- weights: [0.15, 11.90] class: 1 | | | | |--- Age > 57.00 | | | | | |--- weights: [0.75, 0.00] class: 0 | | |--- Income > 113.50 | | | |--- weights: [0.90, 136.00] class: 1
Feature Importance
post_prune_f1_importances = pd.DataFrame(dtree5.feature_importances_, columns=["Imp"], index=X_train.columns).sort_values(by="Imp", ascending=False)
filtered_pp_imp_f1 = post_prune_f1_importances[post_prune_f1_importances['Imp'] > 0.005]
filtered_pp_imp_f1
| Imp | |
|---|---|
| Income | 0.607964 |
| Education_2 | 0.140850 |
| CCAvg | 0.074937 |
| Education_3 | 0.067628 |
| Family | 0.066735 |
| Age | 0.012784 |
| CD_Account | 0.011335 |
| Mortgage | 0.005194 |
plt.barh(range(len(filtered_pp_imp_recall)), filtered_pp_imp_recall['Imp'], color="violet", align="center")
plt.yticks(range(len(filtered_pp_imp_recall)), [filtered_pp_imp_recall.index[i] for i in range(len(filtered_pp_imp_recall))])
plt.xlabel("Feature Importance")
plt.show()
Checking performance on test data
classification_confusion_matrix(dtree5, X_test, y_test)
dtree5_perf_test = classification_performance_metrics(dtree5, X_test, y_test)
dtree5_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979333 | 0.919463 | 0.878205 | 0.898361 |
Model Performance Comparison and Final Model Selection¶
# training performance comparison
models_train_comp_df = pd.concat([dtree1_perf_train.T, dtree2_perf_train.T, dtree3_perf_train.T, dtree4_perf_train.T, dtree5_perf_train.T], axis=1)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning-1)", "Decision Tree (Pre-Pruning-2)", "Decision Tree (Post-Pruning-1)", "Decision Tree (Post-Pruning-2)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree (sklearn default) | Decision Tree (Pre-Pruning-1) | Decision Tree (Pre-Pruning-2) | Decision Tree (Post-Pruning-1) | Decision Tree (Post-Pruning-2) | |
|---|---|---|---|---|---|
| Accuracy | 1.0 | 0.790286 | 0.976286 | 1.0 | 0.988286 |
| Recall | 1.0 | 1.000000 | 1.000000 | 1.0 | 1.000000 |
| Precision | 1.0 | 0.310798 | 0.799517 | 1.0 | 0.889785 |
| F1 | 1.0 | 0.474212 | 0.888591 | 1.0 | 0.941679 |
# testing performance comparison
models_test_comp_df = pd.concat([dtree1_perf_test.T, dtree2_perf_test.T, dtree3_perf_test.T, dtree4_perf_test.T, dtree5_perf_test.T], axis=1)
models_test_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning-1)", "Decision Tree (Pre-Pruning-2)", "Decision Tree (Post-Pruning-1)", "Decision Tree (Post-Pruning-2)"]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree (sklearn default) | Decision Tree (Pre-Pruning-1) | Decision Tree (Pre-Pruning-2) | Decision Tree (Post-Pruning-1) | Decision Tree (Post-Pruning-2) | |
|---|---|---|---|---|---|
| Accuracy | 0.984000 | 0.779333 | 0.962667 | 0.980667 | 0.979333 |
| Recall | 0.879195 | 1.000000 | 0.966443 | 0.872483 | 0.919463 |
| Precision | 0.956204 | 0.310417 | 0.738462 | 0.928571 | 0.878205 |
| F1 | 0.916084 | 0.473768 | 0.837209 | 0.899654 | 0.898361 |
# complexity comparison
models_complexity_df = pd.concat([dtree1_complexity.T, dtree2_complexity.T, dtree3_complexity.T, dtree4_complexity.T, dtree5_complexity.T], axis=1)
models_complexity_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning-1)", "Decision Tree (Pre-Pruning-2)", "Decision Tree (Post-Pruning-1)", "Decision Tree (Post-Pruning-2)"]
print("Complexity Comparison")
models_complexity_df
Complexity Comparison
| Decision Tree (sklearn default) | Decision Tree (Pre-Pruning-1) | Decision Tree (Pre-Pruning-2) | Decision Tree (Post-Pruning-1) | Decision Tree (Post-Pruning-2) | |
|---|---|---|---|---|---|
| Depth | 15 | 2 | 6 | 16 | 7 |
| Number of leaves | 43 | 4 | 27 | 72 | 30 |
| Number of nodes | 85 | 7 | 53 | 143 | 59 |
Going by our criteria that we set earlier:
- Recall is the most important feature, and there is only 1 tree (Decision Tree Pre-Pruning-1) that gives us a recall of 1 on both training and test data.
- However, it should be noted that the Precision value for this associated tree is very low, which could mean that a large number of False Positives can be see using this tree which can occur business expense (this expense depends on how much business allocates to marketing)
- The Decision Tree Pre-Pruning-2 (dtree3) has adequate performance on Recall, Precision and F1-Scores. It has optimal size and it is the recommended model that can be used.
Actionable Insights and Business Recommendations¶
Insights:
- High income customers across almost every other domain are more likely to take out personal loans.
- A visual inspection of the decision tree shows Income to be heavily favored for perfoming splits and decision making on data. Same is indicated by feature importance as well
- There are high income individuals (outliers) who have not converted to personal loans.
- High credit card spenders are also likely to apply for personal loans.
- Undergraduate/non-graduate people don't show much tendency for taking out loans. This includes across categories that have been observed to be loan applicants like the following combinations:
- Undergraduates and high income
- Undergraduates and high credit card spenders
- Undergraduates and high mortage customers (this is comparatively lower than above two)
- Similar to the undergraduates, observations can be made on small families (members 1-2) who don't show much tendency for loan applications:
- High income and small family
- High credit card spenders and small family
- High mortgage and small family (also comparatively lower)
- Younger individuals seem to be more likely for loan applications
Business Recommendations:
- Launch VIP/Premium Load Products for the high income outliers. Offer benefits such as concierge processing or higher loan limits
- Make targetted marketing campaigns towards undergraduate segment. It is possible that some individuals are paying off loans(quite possibly have dropped out due to these payments) so alternate messaging that pivot away from words like "loan" and "debt" may help
- Use "Family Expansion" loans to market towards families of low size, target individuals who may have moved from 2-person to 3-person homes
- Target high credit card spenders with incentives like lower monthly payments
- Prioritize mobile applications ads and social media platforms for targetting younger demographic
- Target marketing towards individuals looking for a higher mortgage